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APPLICATIONS  OF  NEORAL  NETWORK  MODELS  IN 
AUTOMATIC  SPEECH  RECOGNITION 

/ 


1.  INTRODUCTION 

There  has  long  been  an  interest  among  biologists, 
neurophysiologists,  and  experimental  psychologists,  in  the 
manner  in  which  the  operation  of  an  individual  neuron  can 
be  made  to  account  for  the  behavior  of  the  entire  nervous 
system  or  brain.  There  has  been  a  parallel  and  overlapping 
interest  among  computer  scientists  in  demonstrating  how 
complex  behavior  can  emerge  from  primitive  computing  ele¬ 
ments.  These  interests,  taken  in  the  aggregate,  constitute 
an  area  variously  known  as  self-organizing  networks , 
neural  networks ,  learning  networks ,  or  associative 
memories . 

Most  of  the  study  in  this  area  has  been  theoretical. 
Much  has  been  done  by  computer  simulation.  It  has  long 
been  accepted  that  in  order  to  demonstrate  significant 
"intelligent"  behavior,  or  behavior  resulting  from  a  high 
level  of  organization,  one  would  need  the  parallel  opera¬ 
tion  of  millions,  if  not  billions,  of  these  computing  ele¬ 
ments.  Until  recently,  this  was  a  practical  impossibility. 
But  great  advances  in  VLSI  technology  have  been  made  in 
the  past  few  years,  and  the  current  outlook  promises  an 
ultra  large-scale  integration  (ULSI),  with  even  larger 
chip  sizes  and  greater  circuit  densities.  Some  researchers 
have  therefore  turned  to  the  practical  considerations  of 
implementing  learning  networks. 
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This  has  brought  new  life  to  the  study  of  neural  net¬ 
works,  particularly  from  the  point  of  view  of  Computer 
Science.  Here,  the  motivation  for  the  study  is  to  deter¬ 
mine  principles  applicable  to  the  construction  of  comput¬ 
ers  and  computing  modules  for  particular  applications, 
rather  than  to  build  faithful  models  of  biological  organ¬ 
isms.  In  fact,  computer  scientists  are  to  a  certain  degree 
being  driven  to  the  neural  networks  by  the  prospect  of 
ULSI,  since  it  is  becoming  apparent  that  the  traditional 
Von  Neuman  architecture  and  its  multiprocessor  generaliza¬ 
tions  cannot  make  efficient  use  of  ULSI  circuitry. 

In  the  second  section  of  this  report,  we  review  some 
of  the  works  that  have  been  published  on  self-organizing 
networks.  We  summarize  the  various  definitions  of  learn¬ 
ing,  review  the  research  results,  and  outline  characteris¬ 
tics  that  distinguish  the  approaches  to  neural  network 
design.  Of  those  studies  that  were  not  intended  to  be  pre¬ 
cise  models  of  biological  systems,  some  are  abstract 
models  of  pattern  recognition,  and  some  are  oriented 
towards  the  problems  of  image  recognition.  But  it  is  clear 
that  the  principles  of  neural  network  design  can  be 
applied  to  speech  recognition  as  well,  as  long  as  the 
neural  networks  are  deployed  in  such  a  way  as  to  recognize 
what  is  known  within  the  complex  structure  of  speech. 

Thus,  in  the  third  section,  we  propose  anu  discuss 
the  design  of  a  speech  recognition  system  based  on  neural 
networks.  We  offer  some  arguments  in  support  of  the  notion 
that  recognition  of  speech  is  a  task  well  suited  to  neural 
networks,  and  that  this  application  will  benefit  from  the 
earliest  practical  implementations  of  the  technigue. 
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2.  REVIEW  OF  THE  LITERATURE 

In  this  section,  we  review  some  of  the  reports  in  the 
literature  that  describe  networks  of  elements  that  in  some 
way  represent  neurons.  The  elements  will  be  called  either 
neurons ,  cells  or  units,  depending  on  how  closely  their 
properties  are  intended  to  represent  those  of  biological 
neurons . 

The  work  of  Hopfield  [1]  is  useful  as  a  introduction 
and  overview  of  many  neural  networks,  since  it  is  purpose¬ 
fully  abstract.  In  his  model,  a  cell  has  only  the  most 
elementary  properties  considered  fundamental  to  computa¬ 
tion  capability.  In  particular,  the  state  of  cell  i  is 
represented  only  by  its  output  or  firing  rate  . 

In  biological  nervous  systems,  the  average  rate  at 
which  a  cell  fires  is  a  nonlinear  (and  apparently  somewhat 
stochastic)  function  of  its  inputs.  A  model  in  which 
can  take  on  continuous  values  is  called  a  graded- response 
model  [2].  However,  in  Hopfield's  and  other  elementary 
models,  takes  on  only  the  values  zero  and  one. 

The  state  s  of  the  network  is  a  vector  of  the  states 
of  the  cells.  For  a  network  of  n  cells,  s» (V^ ,V2 , . . . ,V  ) . 

The  purpose  of  most  of  the  reported  neural  networks 
is  to  recognize  patterns  from  the  environment.  In  one 
class  of  network,  the  recognition  of  a  pattern  is  indi¬ 
cated  by  the  network  entering  a  specific  state,  or  a 
repeating  sequence  of  closely  related  states.  Thus,  let  S 
represent  a  set  of  designated  states ,  or  (in  Hopfield's 
terms)  memories .  The  network  will  operate  as  an  associa¬ 
tive  memory  if  it  can  be  initialized  with  some  of  the 


cells  set  to  the  values  of  a  particular  memory  s  e  S,  and 
the  rest  set  randomly,  and  it  will  then  move  toward  and 
enter  the  memory  s.  One  might  then  say  it  retrieves  com¬ 
plete  information  from  partial  information,  or  recognizes 
a  complete  pattern  from  a  fragmented  pattern. 


A  biological  neuron  communicates  with  another  through 
a  synapse .  The  strength  of  the  connection  from  neuron  j 
to  neuron  i  is  a  measure  of  the  effectiveness  of  the 


synapse  in  communicating  V.  to  influence  .  In  more 
abstract  terms,  the  strength  of  the  connection  from  cell  j 


to  cell  i  is  described  by  the  weighting  factor 


In  Hopfield's  model,  a  cell  fires  if  the  weighted  sum 
of  its  inputs  is  greater  than  some  threshold  IK: 


If 
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IK  is  a  threshold  value,  which  could  be  varied  as  an  ela¬ 


boration  of  the  model,  but  is  usually  zero. 


A  plastic  synapse  is  one  whose  strength  is  modified 
through  experience.  Modification  of  synapse  strength  or 
connection  weight  is  the  central  mechanism  for  learning  in 
neural  networks.  The  learning  rule  describes  how  the 


weights  T^^  are  modified.  According  to  the  neurological 
model  proposed  by  Hebb  [3] ,  the  strength  of  a  synapse  is 


increased  only  as  a  result  of  the  simultaneous  firing  of 
the  pre-  and  post-synaptic  neurons.  Hence  a  learning  rule 


which  modifies  T ^  only  as  a  consequence  of  simultaneous 
activation  of  cells  i  and  j  will  be  termed  a  Hebbian  rule. 


Hopfield's  model  does  not  describe  how  learning  takes 
place.  It  assumes  that  there  is  a  set  S  of  distinguished 
network  states,  each  of  which  represents  recognition  of  a 
pattern,  and  that  the  weights  have  been  established  on  the 
basis  of  these  states. 

T. .  =  >(2V.-1) (2V.-1)  (2) 

ij  1  J 

Where  the  summation  is  over  all  distinguished  states 

s  c  S ,  and  T . . =0. 

n 

By  (2),  it  is  seen  that  if  a  distinguished  state  has 
one  cell  active  and  the  other  not,  its  contribution  to  the 
connection  weight  between  the  cells  is  negative.  If  the 
state  has  either  both  cells  active  or  both  inactive,  its 
contribution  is  positive.  This  rule  is  consistent  with  the 
Hebbian  learning  rule,  in  that  the  connection  strength  is 
correlated  with  simultaneous  activity  of  the  cells.  But 
here,  the  connection  strength  is  specified  by  the  intended 
cell  activity,  rather  than  caused  by  that  activity. 


Hopfield  considers  the  case  in  which  the  weights  are 
symmetric  ^Tij=Tji^  for  his  analysis.  He  defines  the 
energy  E  of  the  system  by 


E  =  -0.5  ^ 


(3) 


The  change  in  overall  system  energy  resulting  from 
the  change  of  the  state  of  cell  i  is 


dE 


=  -dv.  V  [T. -V. 

l  Zj  1 


(4) 


It  can  be  shown  that  with  symmetric  weights,  and  cell 
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changes  given  by  (1),  the  energy  of  the  system  decreases 
with  every  cell  state  change. 

The  main  result  is  an  estimate  of  the  relationship 
between  the  number  of  designated  states  N  that  can  reason¬ 
ably  be  entered  into  a  network  of  n  cells.  The  equation 
(2)  places  no  limit  on  the  number  of  designated  states; 
however,  if  N  is  too  large  with  respect  to  n,  the  network 
may  wander  in  state  space  and  never  come  to  rest  at  a 
final  state,  or  else  come  to  rest  at  a  state  that  has  no 
relation  to  the  initial  state. 

A  simple  error  measurement  is  obtained  by  starting 
the  network  in  a  state  s  €  S,  and  observing  how  often  it 
will  come  to  rest  at  s  or  a  state  close  to  it.  The  result 
is  that  if  N  exceeds  .15n,  the  error  rate  exceeds  50  per¬ 
cent.  That  relationship  seemed  to  hold  independently  of  n. 
A  smaller,  but  not  negligible  error  rate  could  be  obtained 
with  N=.ln,  or  about  ten  network  cells  per  designated 
state . 


Networks  of  the  kind  defined  by  Hopfield  can  be 
implemented  in  silicon.  The  weights  on  the  links  are  not 
determined  by  the  operation  of  the  network;  they  could  be 
established  at  the  time  the  chip  is  created,  as  a  ROM. 
While  many  circuits  may  be  required,  the  network  will  be 
insensitive  to  the  failure  of  a  few,  hence  be  amenable  to 
ULSI  or  wafer-scale  integration. 
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Silverman,  Shaw  and  Pearson  [4]  describe  another 
model  in  which  the  network  as  a  whole  enters  a  particular 
state  in  response  to  an  environmental  stimulus.  In  their 
model,  each  element,  called  a  "trion",  is  intended  to  be 
an  abstraction  of  a  cluster  of  about  a  hundred  neurons. 
Each  trion  may  assume  one  of  three  possible  states:  +1,0 
or  -1,  respectively  corresponding  to  high,  average  and  low 
output  firing  rates.  Let  g(S)  be  the  initial  probability 
that  a  particular  trion  is  in  state  S.  The  initial  firing 
rates  are  assumed  to  be  such  that  g(0)  >>  g(-l),g(l). 


The  state  of  the  system  at  time  n  is  related  to  the 

system  states  at  times  n-1  and  n-2.  Let  S.,  S'.,  and  S''. 
,  th  3  3  3 

be  the  states  of  the  j  trion  at  times  n,  n-1,  and  n-2 

respectively.  Then  P^S),  the  probability  of  the  ith 

trion  being  in  state  S  at  time  n  is  given  by 


Pi  =  g(S)exp[B-Mi*  S]/(  ^g(s)  exptB-M^*]  )  (5) 


where 


Mi  'S'  vUs'j  +wOs":  1  -  V  - 


V ^  is  a  threshold,  and  B  is  inversely  proportional  to 
noise.  Each  trion  is  influenced  by  the  states  of  its 
neighbors  at  the  previous  two  time  steps.  Hence,  and 
W.,  j  are  the  interaction  weights  between  trion  i  (at  time 
n)  and  trion  j  at  times  n-1  and  n-2  respectively.  The 
weights  may  be  positive  or  negative,  corresponding  to 
excitatory  or  inhibitory  interaction. 


The  network  used  in  the  simulations  consisted  of  six 
trions  arranged  in  a  circle.  Each  trion  interacted  with 
itself  and  and  its  two  nearest  neighbors  on  the  left  and 
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right.  The  V  and  W  weights  were  assigned  as  +1  or  -1. 

Networks  based  on  various  assignments  for  the  weights 
were  usually  found  to  fall  into  one  of  a  few  stable  firing 
patterns,  consisting  of  repeating  trion  firing  sequences 
of  various  lengths. 

The  parameter  B  may  be  thought  of  as  thermal  noise  or 
random  variations  in  cell  output.  As  B  decreases,  the 
noise  level  increases,  resulting  in  occasional  random 
errors.  This  may  cause  the  firing  sequence  to  change  to  a 
nearby  pattern,  which  could  be  interpreted  as  a  form  of 
associated  recall. 

The  firing  patterns  that  develop  in  a  network  with 
assigned  values  for  the  weights  may  be  enhanced  by  Hebbian 
learning  rules  of  the  form 


dV  =  e2[Si(n)Sj(n-1))  (6) 

dW  =  e^[Si(n)Sj(n-2)]  (7) 


where  e>0. 

The  network  might  be  considered  "naive"  before  the 
Hebbian  learning  rules  are  applied,  in  the  sense  that  the 
only  connections  between  cells  are  those  that  are 
prewired,  or  genetically  determined.  Once  the  learning 
rules  take  effect,  the  network  learns  from  experience  to 
choose  firing  patterns  appropriate  to  a  given  input  pat¬ 
tern. 
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In  the  model  described  by  Ackley,  Hinton  and 
Sejnowski  [6],  the  network  output  consists  of  the  state  of 
a  relatively  small  number  of  higher  level  cells  (which  the 
authors  call  units)  in  a  hierarchical  architecture.  The 
novelty  of  this  model  is  found  in  the  use  of  thermal  noise 
as  an  essential  element  in  the  search  for  optimal  global 
solutions . 


Cell  i  has  output  S^,  which  is  either  zero  or  one. 
The  weight  on  the  connection  between  cell  i  and  cell  j  is 
.  The  overall  energy  E  of  the  system  is  defined  by 


-2 


[w.  .s . s  .  ] 

lj  1 


+  2 


[OS  ±  ] 


KJ 


(8) 


where  Q  is  a  threshold. 


Let  dE^  be  the  energy  difference  between  a  state  with 
cell  k  on  and  cell  k  off: 


dEk  -  -  21WKiSi'  '  °k  (91 

t 

Global  energy  is  minimized  by  increasing  the  weights 
for  co-active  cells,  a  Hebbian  rule.  If  a  single  cell  is 
assumed  to  be  constantly  on,  the  threshold  term  can  be 
included  in  the  interactions  between  cells,  simplifying 
the  above  equations: 

E  = -2lwijsis:'  <10> 

iv» 

=  -  V  [W,_,S_.  ] 


dE, 


(11) 


Noise  is  utilized  to  help  the  system  escape  from 
local  energy  minima.  The  values  are  not  fully  deter¬ 
mined  by  the  system  state  and  weights.  Instead,  there  is  a 
probabilistic  decision.  Let  be  the  probability  that 
will  be  set  to  one.  The  cell  state  transition  is  described 
by 

Pk  =  1/(1  +  exp[-dEk/T]  )  .  (12) 

As  dE  gets  larger,  P,  approaches  1.  As  T  gets  larger,  P, 
approaches  zero.  Therefore  at  low  T,  a  relatively  small 
energy  gap  can  cause  a  change  in  sta^e,  and  at  high  T,  a 
large  dEk  is  needed  to  set  Pk  to  one. 

At  equilibrium, 

Pa/Pb  =  exp[-(Ea-Eb)/T]  ,  (13) 

where  Pa  and  Pb  are  the  probabilities  of  the  a  and  b  glo¬ 
bal  states. 

Low  T  favors  states  with  low  energy,  but  the  rate  at 
which  the  optimal  state  is  approached  is  slow.  High  T 
favors  low  energy  less  strongly,  but  the  time  to  reach 
equilibrium  is  reduced. 

The  information  gain  of  the  system  is  defined  by 

G=  Y[P(V  )  In  [P(V  )/P '  (V  )  ]  ] 

x  i  a  a  a 

a 

where  P(V  )  represents  the  probability  of  the  network 

3 

being  in  state  a  with  the  network  "clamped"  (the  states  of 

some  of  the  cells  are  fixed  by  the  input  pattern),  and 

P ' ( V  )  is  the  same  probability  when  the  network  is  free 
3 
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running.  For  example,  if  for  all  a,  F(V  )=P'(V  ),  the 

3  3 

information  gain  G  is  zero. 


Gradient  descent  is  performed  by  modifying  link 
weights : 


dG/DW^  =  -(l/T)(Pij  -  p'i;.)  (15) 

where  p^  and  P'^j  are  the  probabilities  that  both  cells  i 
and  j  are  active  when  the  network  is  clamped  and  free- 
running,  respectively,  in  thermal  equilibrium.  Therefore, 
the  network  adapts  to  stimuli  by  changing  link  weights  in 
proportion  to  p^^  and  P'^j  respectively. 


e.(pu 


where  e  is  a  constant. 


By  the  equations  above,  if  a  pair  of  cells  have  a 
higher  probability  of  being  concurrently  active  when  the 
network  is  clamped  than  when  free  running,  weight  will  be 
added  to  the  link  between  them.  This  will  result  in  a 
decrease  in  system  energy. 


The  weight  increment  dw^  is  determined  entirely  on 
the  basis  of  locally  available  information,  while  affect¬ 
ing  the  global  energy  level.  Minimizing  G  is  the  process 
of  the  network  capturing  regularities  in  input  patterns. 

The  authors  provided  results  of  simulations  involving 
two  layered  networks  of  various  sizes.  When  seeking 
equilibrium,  the  system  was  run  at  a  series  of  decreasing 
temperatures,  so  that  the  global  energy  minimum  was 
approached  in  a  reasonably  short  period  of  time,  while 
still  avoiding  capture  at  local  energy  minima.  The 


concept  of  slow  reduction  of  temperature  in  order  to 
achieve  an  optimum  result  is  known  as  a nnea ling.  Its  use 
in  optimization  algorithms  in  general  is  discussed  in  [7], 

Because  the  probability  distribution  used  to  deter¬ 
mine  state  changes  was  the  Boltzmann  distribution, 
parameterized  by  temperature,  the  network  was  called  the 
Boltzmann  machine.  It  was  run  to  solve  a  series  of  binary 
encoding  problems.  As  the  complexity  of  the  problems 
increased,  the  number  of  learning  cycles  required  to  solve 
the  problem  also  increased,  and  in  some  cases  the  best 
solution  was  not  found. 


Rummelhart  and  Zipser  [7]  also  describe  a  multilay¬ 
ered  hierarchical  architecture  with  Hebbian  learning 
rules.  In  their  model,  the  inputs  to  a  cell  at  one  layer 
come  from  the  cells  at  the  layer  below.  The  weights  asso¬ 
ciated  with  the  inputs  to  a  particular  cell  sum  to  one. 

The  network  includes  the  powerful  feature  of  lateral 
inhibition,  in  which  an  active  cell  can  inhibit  the  cells 
at  the  same  level.  In  the  model  of  [7],  cells  are  arranged 
in  clusters.  All  cells  of  a  cluster  receive  the  same 
input  signals,  but  each  cell  weights  them  differently.  The 
cells  output  either  zero  or  one.  The  cell  with  the  largest 
weighted  input  in  the  cluster  outputs  one;  all  others  have 
zero  output.  The  mechanism  of  the  cluster  may  be  thought 
of  as  a  competition,  in  which  the  cell  that  wins  the  right 
to  output  also  inhibits  the  other  cells  of  the  cluster. 
The  cells  at  the  lowest  level  of  the  hierarchy  represent 
the  input  pattern  or  stimulus. 

The  learning  rule  for  this  model  is  the  following. 

The  weight  on  the  input  to  cell  j  from  cell  i  on  the  lower 

level  is  W.  .  .  If  pattern  S.  is  presented  to  the  network, 
1 3  K 

cik  ke  the  output  of  cell  i  (on  the  lower  level).  For 

each  pattern  S^,  the  weights  on  the  inputs  to  a  cell  are 
modified  only  if  the  cell  wins  the  competition  within  its 
cluster.  That  is,  with  pattern  S^, 

dW^j  =0  if  cell  j  loses, 

and 

dWij  =  9^cik//nk^  “  9  Wij  if  cel1  7  wins.  (17) 
Here,  nk  is  the  number  of  active  cells  in  pattern  S^; 


Let  be  the  probability  of  cell  j  winning  on 
presentation  of  stimulus  S^,  and  let  be  the  probability 
of  Sk  being  presented  on  a  given  trial.  At  equilibrium. 


)  dW . . V . 

Zj  J 


kpk  -  0  • 


As  in  the  works  described  above,  the  authors  define 
a  global  energy  term.  In  this  case,  the  parameter  T  quan¬ 
tifies  system  stability,  and  hence  is  the  negative  of 


energy , 


where. 


T=  'ZPk'Zlvjk 

k  V 

-  £  twi 


k|ajk-aik»" 


(19) 


ijcik1  * 


T  is  the  amount  by  which  the  weighted  input  to  winning 
cells  exceeds  the  weighted  input  to  all  other  cells,  aver¬ 
aged  over  all  stimuli.  Since  T  is  the  negative  of  energy, 
it  must  be  maximized. 


A  weakness  of  the  learning  rule  based  on  competition 
is  that  if  some  cells  original  (naive)  are  not  related  to 
any  stimulus,  it  may  never  win  the  competition,  hence 
never  learn.  A  modification  of  the  learning  rule  permits 
"leaky  learning"  in  which  all  link  weights  are  modified; 
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dWij  =»  9icik  “  91Wi^  if  cell  j  loses  on 

and 

^ij  =  gwcik  “  gwWij  if  cel1  j  wins  on  Sk  (20) 

where  gw>>g^.  With  this  rule,  cells  that  constantly  lose 
the  competition  move  slowly  towards  the  active  patterns, 
so  as  to  eventually  win. 

The  authors  describe  simulations  of  networks  with 
varying  numbers  of  input  cells  on  the  lower  level,  and  one 
cluster  of  two  units  on  the  upper  level.  A  few  letter 
recognition  experiments  worked  predictably  well.  Difficul¬ 
ties  were  exposed  in  the  attempts  to  train  the  upper  level 
cells  to  distinguish  between  vertical  and  horizontal 
lines.  Each  of  the  pattern  sets  that  should  be  recognized 
as  similar  (the  set  of  vertical  lines  and  the  set  of  hor¬ 
izontal  lines)  consist  of  disjoint  elements:  parallel 
lines  do  not  intersect.  But  by  the  pattern  recognition 
scheme,  patterns  are  recognized  as  similar  by  the  number 
of  points  they  have  in  common.  Therefore,  because  of  the 
single  point  they  have  in  common,  a  vertical  line  and  a 
horizontal  line  are  considered  more  alike  than  any  pair  of 
horizontal  or  vertical  lines.  The  result  of  exciting  the 
network  with  a  series  of  vertical  lines  and  horizontal 
lines  is  that  the  upper  levels  systematically  discard 
exactly  what  they  are  intended  to  capture. 

The  authors  demonstrate  that  their  model  can  be 
trained  to  capture  the  idea  of  vertical  vs.  horizontal  in 
the  following  way.  The  higher  level  clusters  are 
increased  from  two  to  four  cells  each,  and  a  third  level, 
with  a  single  cluster  of  two  units  is  added.  Then  each 


time  a  vertical  line  was  presentee  on  tne  right  side  of 
the  input  matrix,  it  was  accompanied  by  a  vertical  line  in 
the  leftmost  column.  Similarly,  horizontal  lines  were 
accompanied  by  a  horizontal  line  in  the  uppermost  row.  The 
vertical  "training  patterns"  had  more  points  in  common 
with  each  other  than  with  horizontal  training  patterns,  and 
so  the  level  two  units  easily  learned  to  distinguish 
between  them.  With  four  cells  on  the  second  level,  two 
would  develop  weights  that  would  allow  the  recognition  of 
vertical  training  patterns:  they  divided  this  set  between 
themselves.  Two  cells  would  similarly  recognize  horizon¬ 
tal  training  patterns.  When  the  training  lines  were 
removed  from  the  input  patterns,  the  remaining  parts  of 
the  patterns  were  recognized  in  the  same  way.  And  since 
there  are  two  second-level  clusters,  the  third  level  would 
always  have  two  out  of  eight  inputs  active,  from  which  it 
would  easily  distinguish  the  vertical  from  the  horizontal 
patterns . 


In  the  work  of  Grossberg  [8],  an  "on  center/of.f  sur¬ 
round"  architecture  is  described.  Each  cell  receives,  in 
add  it  ion  to  its  own  (center)  input,  weighted  negative 
inputs  which  are  the  center  inputs  to  neighboring  cells. 
The  cells  are  organized  in  layers;  layer  one  or  v^  con¬ 
sists  of  cells  2  •  '  *  *  '^ln  *  T^e  outPut  °f  cell 

is  the  continuous  variable  x^  .  (This  is  a  graded- 
response  model.)  At  level  one,  the  cells  are  excited  by 
an  external  pattern.  1^  is  the  part  of  the  pattern 
presented  to  to  compute  x^. 

The  equation  governing  output  by  a  given  neuron  is 

d*u  =  -AXU  ♦  (B-*u)It  -  xu.]Tlk  (21) 

kVi 

where  B>x, .  . 

li 

The  first  term  in  (21)  specifies  exponential  decay 
of  the  output,  based  on  the  constant  A.  This  permits 
gradual  loss  of  memory. 

The  second  term  is  the  "on  center"  part  of  the 
expression.  If  1^  is  large,  and  x^<<B,  then  dx^  is 
strongly  positive,  and  becomes  larger. 

The  third  term  is  the  "off-surround".  It  decreases 
the  output  of  in  proportion  to  the  sum  of  the  input 

pattern  surrounding  that  part  of  the  pattern  presented 
directly  to  V^. 

Let  I  =  ]£l.  and  let  Q.  =1.1  * .  At  equilibrium, 
k  K  11 

xu  =  Q.-B*I/(A  +  I)  (22) 


The  internal  network  outputs  are  adjusted  to  be  relative 


ni  sm 


to  tne  inte 
works  like  an 


nsity  of 
automatic 


the  input  pattern, 
gain  control. 


This 


mecha 


To  store  patterns  after  input  ceases,  a  reverberat¬ 
ing  architecture  is  proposed.  Cells  on  level  two  excite 
and  inhibit  level  three  cells  by  the  on-center/off- 
surround  technique.  Level  three  cells  in  turn  excite  the 
level  two  cells  in  the  same  way.  This  may  be  described 
by  the  equation 

dx2j  =  “Ax2j  +  *B_X2j^f  (x2j  ^+I2j^-X2j<*f  (x2k*  '  (23) 

U*1 

where  f(w)  is  a  feedback  signal  produced  by  average 
activity  w.  In  (23),  the  on-center  term  is  related  to 
both  the  original  input  pattern  and  the  feedback  signal 
from  level  three.  The  off-surround  term  is  from  units  on 
the  third  level.  Since  these  units  are  excited  by  neigh¬ 
bors  of  v2j'  the  cell  is  indirectly  inhibited  by  its 
neighbors . 


Let  Z..  be  the  link  weight  from  V..  to  V, . ,  and 
1  j  ^  1 1  2j 

be  the  signal  from  to  V2 j  *  T^e  learnin9  rule  is 


D.  . 
1 3 


dZ .  .  =  — c .  .  +  D.  .  x~. 
13  13  13  2j, 


(24) 


where  c^^  is  the  decay  rate  of  Z^j.  The  second  term  of 
the  above  expression  indicates  that  the  learning  rule  is 
Hebbian;  the  weights  are  increased  whenever  the  pre-  and 
post-synaptic  cells  are  active. 


The  model  offers  two  possible  scenarios  for  pattern 
capture;  choice,  and  partial  contrast. 
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choice : 

f  1  if  Sj  >  max[e,  S^rkOj  ] 

xij"  I 

(  0  if  S.  <  max[e,  S,:kOj  1 

J  * 


contrast : 


x 


ij 


■  f  (Sj  )^[f  (Sk)J-1  if  Sj  >  e 

sk>c 

i.  0  if  S .  <  e. 


In  the  choice  mode,  only  the  cell  with  the  strongest 
excitation  (above  threshold  e)  becomes  active.  In  par¬ 
tial  contrast,  all  cells  above  threshold  respond  in  pro¬ 
portion  to  their  relative  input  levels  above  threshold. 

There  is  a  noteworthy  degree  of  underlying  similar¬ 
ity  between  Grossberg's  model  the  Trion  model  of  Silver- 
man,  et.  al.  In  Grossberg’s  model,  firing  patterns 
reverberate  (and  are  enhanced)  between  levels  two  and 
three  of  the  network.  In  the  Trion  model,  patterns 
reverberate  temporally  and  spatially;  each  trion  may  be 
subject  to  excitation  or  inhibition  from  itself  and  its 
neighbors  (all  on  a  single  level)  over  the  past  two  time 
periods.  It  does  not  seem  unreasonable  to  suppose  that 
the  temporal  aspect  of  the  Trion  model  could  in  fact  just 
as  easily  be  represented  with  a  reverberating  spatial 
pattern . 


Fukushima  [9]  describes  a  multilayered  network  with 
a  modified  Hebbian  learning  rule  and  a  partially  stochas¬ 
tic  interlayer  connection  pattern.  The  learning  rule  is 
the  following:  Connections  from  cell  x  to  cell  y  are 
reinforced  if  x  fires  and  y  is  firing  more  strongly  than 
any  other  post-synaptic  cell  in  its  neighborhood.  Among 
cells  in  the  vicinity  of  a  particular  cell,  only  one  is 
reinforced  at  a  given  time. 

This  learning  rule  bears  a  strong  similarity  to  the 
"competitive  learning"  mechanism  of  Rummelhart  and  Zipser 
[7],  in  that  only  only  one  cell  in  a  particular  grouping 
is  reinforced.  But  it  is  not,  strictly  speaking  a 
lateral  inhibition  mechanism,  since  the  neighboring  cells 
are  not  prevented  from  firing.  After  learning  has  taken 
place,  only  the  one  cell  responding  to  a  particular  pat¬ 
tern  will  fire. 

Suppose  u(i)  and  v(i)  are  the  ith  excitatory  and 
inhibitory  inputs,  respectively,  to  a  particular  cell. 
Let  a(i)  and  b(i)  be  the  respective  conductances 
(weights)  of  these  inputs.  The  rule  for  determining  the 
output  W  of  a  cell  is  described  as  follows. 

w  =  f((  l  +  5>(j)u(j)  >/(  l+2>(j)v{j)  )  ~  1]  (25) 

;  j 

where  f[x]  =0  for  x<0  and  f[x]=x  for  x^O. 

Suppose  e  and  h  respectively  represent  the  total 
excitatory  and  inhibitory  inputs  to  a  cell: 

e  =  £a(j)v(j)  and  h  =  £b(j)u(j)  . 
j  -> 

Then  (25)  can  be  rewritten  as 


W  =  f [(l+e)/(l+h)  -1]  =  f [(e-h)/(l+h)] 


(26) 


Then  the  gain  control  mechanism  for  the  cell  output  can 
be  seen.  If  h<<l,  W  is  approximately  f  [e-h] .  If  e>>l  and 
h>>l,  W  is  approximately  f  [  ( e/h )  —  1 ] .  This  last  condition 
is  the  state  of  the  network  after  learning  has  occurred. 
The  weights  a  and  b  increase  indefinitely.  For  this  rea¬ 
son,  thresholds  do  not  make  sense  for  this  model. 

Each  cell  at  level  k  receives  excitatory  inputs  from 
cells  in  a  particular  area  on  the  next  lower  level  (its 
"connectable  area"),  and  one  inhibitory  input  from  the 
same  area.  The  index  j  will  span  the  connectable  area. 
Then  u^_^(n+j)  will  be  an  excitatory  input  to  cell  n  at 
level  k,  from  cell  n  +  j  at  level  k-1.  The  inhibitory  cell 
on  the  lower  level  receives  input  from  the  same  cells  as 
excite  the  upper  level  cell.  Thus,  the  inhibition  of  cell 
n  at  level  k  is  the  sum  of  the  excitatory  inputs  to  that 
cell,  multiplied  by  the  (unmodif iable)  weights  from  exci¬ 
tatory  to  inhibitory  cells  within  level  k-1. 

vR(n)  =Xck-l(^)  uk-l(n+^  (27) 

u 

The  excitatory  input  to  cell  n  at  level  k,  uR(n),  is 
greater  than  zero  if  the  sum  of  the  excitatory  inputs 
from  the  connectable  area  on  level  k-1  is  greater  than 
the  inhibitory  input. 

uk  =  f  [  ( l+^ak  (  j  ,n)uk_1  (n  +  j  )  )  /  ( 1  +bk  (  n  )  vR_1(n))-l]  (28) 

J 

The  equations  for  modification  of  excitatory  and 
inhibitory  connection  weights  are  such  that  if  no  cell  in 


I  ------  .  ------- . —  --  - 


the  neighborhood  of  n  is  firing,  all  cells  are  (weakly) 
reinforced.  If  one  cell  is  firing  more  strongly  than  the 
others,  it  is  the  only  one  to  be  reinforced. 


Let  gR(n)  be  a  function  that  takes  on  the  value  one 
if  no  cell  in  its  vicinity  is  firing  more  strongly  (i.e., 
uR ( n ) >=uR ( n+ j ) ,  and  value  zero  otherwise.  If  u^(n)=0 
then 

daR(j,n)  =  q0ck_1 (j )uk_1 (n+j ) *gk (n)  (29) 

dbk(n)  =  Qq* vk_i ( n) *gk(n) .  (30) 

/ 

If  uR(n)  >  0  then 

dak(j,n)  =  <?!*  ck_i  (  j  )  *  uk-1  (  j  +  n) .  gk  (n)  (31) 

dbR ( n )  = [ (  £  aR ( j , n)uk_1 ( j+n)  )/(2  vR_1 ( n ) ) ] - gR ( n ) , ( 3 2 ) 

J 

Excitatory  reinforcement  from  a  cell  on  level  k-1  to  a 
cell  on  level  k  is  a  constant  times  the  output  level  of 
the  level  k-1  cell,  provided  that  the  total  excitation  to 
the  level  k  cell  is  is  higher  than  to  any  other  cell  in 
its  vicinity.  If  the  level  k  cell  is  receiving  subthres¬ 
hold  excitation,  the  reinforcement  is  weaker  than  if  it 
had  suprathreshold  excitation. 

The  equations  for  reinforcement  of  inhibitory  con¬ 
nections,  similarly,  only  apply  for  the  most  strongly 
excited  level  k  cell  in  its  vicinity.  For  the  subthres¬ 
hold  case,  reinforcement  is  a  constant  times  the  output 
of  the  level  k-1  inhibitory  cell.  For  the  suprathreshold 
case,  the  reinforcement  equation  is  more  complex,  but  has 
the  following  implications. 


Let  r(i,n)  be  the  ratio  of  excitatory  to  inhibitory 
reinforcement ; 

r(j,n)  =  da(  j  ,  n)/(c)j._1  (  j  )  db(n)). 

It  can  then  be  shown  that 

r>l  if  uk(n)=0  and  ( n+ j ) >vk_^ ( n+ j )  or 

if  uk(n)>0  and  uk-1 { n+ j ) >^ ck-1 ( z ) uk-1 ( n+z )/ ( 2vk-1 ( n ) ) . 

2 

The  first  condition  means  that  if  cell  n  is  not  receiving 
suprathreshold  levels  of  excitation,  then  only  if  the 
level  of  output  from  the  excitatory  cell  on  level  k-1 
exceeds  the  (weighted)  output  of  all  cells  in  its  vicin¬ 
ity  (which  is  the  definition  of  will  excitatory 

connections  from  level  k-1  to  level  k  be  more  strongly 
reinforced  than  inhibitory  connections.  The  equation  for 
uk(n)>0  shows  that  (in  the  simplified  binary  case)  if  a 
cell  responds  stronger  than  1/2,  its  excitatory  connec¬ 
tions  are  more  strongly  reinforced  than  its  inhibitory 
connection. 


The  result  of  these  equations  is  that  cells  in  a 
given  vicinity  responding  most  strongly  to  stimuli  from 
the  previous  layer  have  their  excitatory  connections 
enhanced  while  weakly  responding  cells  have  their  inhibi¬ 
tory  connections  enhanced.  Over  time,  the  network 
evolves  so  that  the  number  of  cells  in  a  vicinity 
responding  to  a  particular  stimulus  approaches  unity. 

Fukushima  discusses  different  strategies  for  inter¬ 
connecting  network  layers.  For  the  case  where  cells  in 
all  layers  have  equal  connectable  areas  on  the  previous 
layer,  many  layers  are  required  to  cover  a  large  area  on 
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the  lowest  layer.  If  the  connectable  areas  decrease  with 
increasing  depth  into  the  network  layers,  the  uppermost 
level  will  be  connected  to  a  larger  area  of  the  input 
layer.  The  problem  with  this  is  that  too  much  overlap 
occurs,  so  that  only  one  or  a  few  cells  fire  on  the 
uppermost  layer.  Fukushima  considers  this  to  be  undesir¬ 
able,  since  a  "concept"  should  consist  of  a  more  complex 
pattern.  And,  no  further  processing  on  higher  network 
levels  could  occur  since  cells  on  the  next  higher  level 
would  have  only  a  single  firing  cell  in  their  connectable 
areas  a  given  time. 

Fukushima  uses  bifurcating  excitatory  connections. 
One  goes  directly  to  the  cell  in  the  corresponding  posi¬ 
tion  on  the  next  lower  level,  and  the  other  is  probabil¬ 
istically  connected  in  a  manner  such  that  large  spatial 
deviations  are  less  prone  to  occur  than  smaller  devia¬ 
tions.  with  this  method,  each  cell  on  the  uppermost 
level  is  connected  (in  a  somewhat  stochastic  manner)  to 
the  entire  input  field. 

Simulations  were  run  using  four  12x12  layers  with 
letters  as  patterns  at  the  lowest  level.  Fukushima  shows 
how  individual  units  on  the  fourth  layer  capture  the 
entire  pattern  presented  to  the  first  layer,  while  cells 
on  the  third  and  second  levels  capture  smaller  parts  of 
the  pattern. 


T 
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class  cells.  This  is  a  form  of  lateral  inhibition. 

The  matrix  functions  as  follows.  The  initial 
transfer  weights  of  all  cells  are  assumed  to  be  small,  so 
that  the  initial  excitation  by  a  pattern  impinging  on  the 
imaging  matrix  produce  only  feeble,  random  firing  of  the 
filter  cells.  But  when  such  firing  occurs,  the  connec¬ 
tions  of  the  active  mosaic  cells  to  the  filter  cells  will 
be  reinforced,  thus  making  the  filter  cells  that  fired 
more  sensitive  to  the  pattern.  Whenever  the  pattern  is 
again  presented,  the  sensitized  filter  cells  fire  at  a 
higher  rate,  eventually  triggering  a  class  cell.  The 
first  class  cell  to  fire  also  causes  the  inhibitory  reset 
cell  to  fire,  thereby  preventing  all  other  class  cells 
from  firing.  Thus,  after  the  Hebbian  conditioning,  only 
one  class  cell  will  be  active  for  a  given  pattern.  But 
the  bifurcating  branch  of  the  class  cell  is  adaptively 
linked  to  the  mosaic  cells.  The  synapses  on  those  mosaic 
cells  that  were  excited  by  the  pattern  are  reinforced. 
Now  consider  the  possibility  that  the  class  cell  that 
responded  to  the  pattern  is  stimulated  by  some  external 
signal,  such  as  might  result  from  some  higher  intellec¬ 
tual  function,  or  just  randomness.  Then  the  original  pat¬ 
tern  will  be  regenerated  in  the  mosaic  cells.  This  is  the 
basis  for  the  function  of  "imagination"  in  this  neural 
network . 

Trehub  also  discusses  a  "novelty  detecting"  cell. 
This  cell  slowly  accumulates  the  excitatory  input  from 
the  imaging  matrix.  If  a  class  cell  fires,  the  novelty 
detector  is  reset.  But  if  no  class  cell  responds  to  the 
input  in  a  certain  period  of  time,  the  novelty  cell  would 
reach  its  threshold  and  fire.  The  consequences  of  the 


•\  -r  . 


•«'.  *„  .  r_' 

i^.V  u. kTm. i v 


.  a  ft-A  m. A  ft. A  *  Hi. 


novelty  cell  firing  would  be  to  lower  thresholds 
throughout  the  set  of  filter  cells,  allowing  an  unmodi¬ 
fied  filter  cell  to  fire,  capturing  the  pattern. 

Trehub  describes  an  organization  of  special  purpose 
interacting  networks  called  "retinoids",  which  perform 
transformations  such  as  translation,  rotation  and  scaling 
of  the  input  pattern.  Each  of  these  subnetworks  would  be 
activated  by  a  particular  neuron,  and  the  transformed 
pattern  would  then  be  re-presented  to  the  class  cells. 
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2 .  TECHNICAL  DISCUSSION 

OUTLINE  FOR  NEURAL-NETWORK  BASED  SPEECH  RECOGNITION 


Neural  networks  have  been  studied  either  for  the 
abstract  problem  of  pattern  recognition,  in  which  the 
nature  of  the  patterns  is  not  specified,  or  else  for  the 
specific  application  area  of  image  analysis.  None  of  the 
works  surveyed  explicitly  addressed  the  problem  of  speech 
recognition.  Yet  the  body  of  general  and  image-  oriented 
neural  network  research  illuminates  several  possibilities 
for  neural-network  based  recognition  systems  for  speech. 
In  this  section,  we  will  outline  such  a  system.  It  is 
based  on  both  the  neural-network  research  and  traditional 
speech-recognitions  systems. 

At  the  theoretical  level,  speech  recognition  is  no 
different  than  image  recognition.  A  given  (isolated) 
utterance  can  take  the  form  of  the  two-dimensional  pat¬ 
tern  x(f,t),  representing  the  magnitude  of  the  frequency 
component  f  at  time  t.  But  without  taking  the  specific 
nature  of  the  speech  signal  into  account,  one  is 
presented  with  immense  computational  difficulties.  In 
order  to  make  the  necessary  distinctions  between  utter¬ 
ances,  the  number  of  points  required  for  x(f,t)  would  be 
in  the  thousands,  and  the  number  of  cells  required  for 
successful  recognition  would  be  beyond  the  scope  of  any 
conceivable  implementation.  Furthermore,  the  approach 
would  work  (if  at  all)  only  for  utterances  of  fixed 
length,  and  would  not  be  genera  1 i zable  to  utterances  of 
greater  length. 


Yet  some  of  the  techniques  or  network  ins  used 
in  image  analysis  may  be  helpful  in  speed  ic  subtasks  of 
speech  recognition  problem.  Consider  the  well-known 
techniques  of  adjusting  for  variations  in  amplitude  (nor¬ 
malization)  and  time  (dynamic  time  warping)  in  a  speech 
signal  presented  to  a  traditional  recognition  system 
based  on  matching  stored  templates.  There  are  analogous 
operations  for  image  recognition:  rotation,  translation, 
and  scaling.  These  analogous  operations  have  been  stu¬ 
died,  and  neural  networks  have  been  proposed  to  effect 
them.  See,  for  example,  [10].  Variations  of  these  tech¬ 
niques  should  be  accessible  where  needed  for  the  speech 
recognition  system.  The  details  of  these  mechanisms  will 
not  be  developed  in  the  outline  of  the  speech  recognition 
shown  here. 

The  most  successful  speech  recognition  systems 
employ  the  sort  of  layering  that  is  necessary  to  reduce 
the  amount  of  learning  that  must  be  performed  by  a  sin¬ 
gle  neural  network.  Each  layer  attempts  the  recognition 
of  successively  larger  sound  units,  and  passes  the  condi¬ 
tionally  recognized  items  to  the  next  higher  layer. 

This  layered  organization  is  quite  similar  to  the 
structure  of  the  perceptron  -  an  early  neural  network 
model  [11].  The  perceptron  and  related  models  performed 
a  unidirectional  transmission  of  signals  from  input  layer 
to  output  layer.  They  employed  unsupervised  learning: 
feedback  from  the  external  world  determined  the  modifica¬ 
tion  of  link  weights.  Specifically,  weights  on  the 
active  cells  were  increased  if  the  network  produced  the 
correct  response,  and  were  decreased  if  the  response  was 
incorrect.  The  perceptron  was  subjected  to  mathematical 
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analysis  that  showed  its  power  to  De  less  than  what  had 
been  expected,  and  so  the  research  interest  moved  on  to 
more  general  models.  [12}  However,  the  experience  with 
the  perceptron  and  related  unidirectional  hierarchical 
models  showed  the  difficulty  of  effecting  supervised 
learning  in  a  layered  structure.  Whenever  such  a  model 
produces  the  wrong  response,  the  learning,  or  modifica¬ 
tion  of  excitatory  weights,  affects  the  cells  that 
operated  correctly  as  well  as  those  responsible  for  the 
errors.  The  problem  is  that  when  wrong  responses  occur, 
it  is  impossible  to  tell  which  level  is  responsible. 

But  if  the  system  is  layered  in  such  a  way  that  the 
outputs  of  each  layer  are  understandable  units, 
feedback-conditioning  loops  can  be  applied  for  each 
layer.  This  provides  the  basis  for  the  proposed  neural- 
network  based  speech  recognition  system. 

Figure  1  is  an  outline  of  a  neural-network  based 
speech  recognition  system.  The  recognition  system  is 
comprised  of  several  successive  "modules"  of  cells  with 
plastic  links.  Each  of  the  neural  modules  is  comprised  of 
several  layers  of  cells.  The  cell  links  are  modified  in 
the  "supervised  learning"  mode:  the  weights  of  active 
cells  are  either  increased  or  decreased  depending  upon  an 
external  signal  that  indicates  whether  the  response  of 
the  module  was  correct.  Thus,  the  output  of  each  module 
is  converted  into  a  signal  that  can  be  displayed  to  the 
user  of  the  system,  who  provides  the  response. 

The  need  for  presentation  of  the  module  output  to 
the  user  does  not  mean  that  the  module  output  must  itself 
be  a  unique  code  indicating  the  object  recognized.  It  is 


the  nature  of  neural  networks  to  process  patterns  in 
"fuzzy"  way.  A  small  increment  in  the  weight  of  any  link 
in  the  network  could  result  in  the  variation  of  a  single 
bit  of  the  output:  but  this  should  not  be  considered  a 
different  result.  Hence,  the  patterns  representing  the 
desired  results  of  each  module  should  be  separated  by 
Hamming  distances  of  at  least  three,  allowing  for  one-bit 
error  correction.  (For  integration  of  the  modules  into 
one  fluid  system,  the  separation  of  the  output  codes 
should  be  much  greater.)  The  units  marked  "focus"  in  the 
figure  are  fixed  combinational  logic  whose  purpose  is  to 
provide  a  single  unique  code  for  each  output,  from  the 
fuzzy  variations  produced  by  the  module. 

It  should  be  emphasized  that  the  training  mode,  with 
the  feedback  loops,  exists  only  to  assist  the  system  to 
make  the  correct  responses,  by  training  its  operation  at 
the  appropriate  level.  In  the  operating  mode,  it  will  not 
be  necessary  for  the  user  to  provide  feedback,  and  the 
modification  of  link  weights  will  be  turned  off,  or 
drastically  attenuated. 

Each  of  the  modules  operates  at  a  different  level  of 
organization  of  the  input  signals.  The  input  signal  is 
digitized,  accumulated  into  frames  (at  intervals  of  5-30 
ms.)  and  filtered  to  obtain  either  a  set  of  LPC  coeffi¬ 
cients  or  a  representation  of  the  frequency  spectrum. 
These,  along  with  an  indication  of  the  amplitude  of  the 
speech  signal,  provide  the  input  to  the  first  neural 
module . 

The  first  neural  module  determines  phonemes. 
Phonemes  are  the  linguistic  units  (or  sound  elements) 


that  are  defined  by  their  ability  to  make  the  critical 
distinctions  between  words.  Linguists  identify  about  40 
phonemes  for  English.  A  dictionary  of  phonemes,  and  com¬ 
mon  words  in  which  they  occur,  will  be  made  available  to 
the  user,  so  that  he  can  pronounce  them  (in  his  own 
voice),  and  train  the  system  to  recognize  them.  The 
phoneme  identification  (output  of  the  "focus"  unit)  need 
be  no  more  than  six  bits.  However,  the  phoneme  module 
output  should  be  significantly  greater.  For  example,  a 
sixteen-bit  representation  would  allow  for  a  Hamming  dis¬ 
tance  of  seven  between  phoneme  codes,  providing  three-bit 
error  correction. 

The  second  neural  module  recognizes  linguistic  units 
called  sy ltypes .  The  syltype,  which  is  defined  and  used 
in  the  Hearsay-II  speech  understanding  system,  represents 
a  class  of  syllables.  [13,14]  The  number  of  distinct  syl- 
types  in  a  given  vocabulary  is  only  a  small  fraction  of 
the  number  of  distinct  syllables  found  in  the  vocabulary. 
The  set  of  syltypes  is  determined  by  grouping  the 
phonemes  into  classes  based  on  similarity  of  sound:  there 
may  be  seven  such  classes.  Then,  for  a  given  vocabulary, 
a  state  transition  diagram  is  developed.  The  states  are 
the  phoneme  classes,  and  the  transitions  through  the 
diagram  (involving  one  to  about  six  phoneme  classes)  are 
the  syltypes. 


The  relationship  between  the  number  of  syltypes  and 
the  vocabulary  size  is  given  in  [14].  A  1000-word  voca¬ 
bulary  requires  about  250  syltypes.  Thus,  the  output  of 
the  syltype  recognition  module  need  be  only  eight  bits. 
But  for  seven-bit  separation  between  syltype  codes, 
allowing  3-bit  error  correction,  the  output  of  this  unit 
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should  be  21  bits.  The  syitype  focus  unit  will  then  be 
able  to  provide  the  user  with  a  unique  code  for  the 
recognized  syitype,  and  allow  the  user  to  return  the  con¬ 
ditioning  signal.  It  is  assumed  that  the  user  will  have 
an  online  dictionary  of  syltypes  and  associated  syll¬ 
ables,  so  that  he  may  easily  determine  whether  the  recog¬ 
nition  is  correct. 

Some  additional  structure,  apart  from  the  neural 
modules,  will  be  required  to  resolve  the  differences  in 
time  scale  at  each  level.  The  output  of  the  phoneme 
module,  for  example,  is  a  single  phoneme,  but  a  syitype 
is  determined  by  a  sequence  of  phonemes.  Therefore,  some 
sort  of  memory  is  required  at  the  input  to  the  syitype 
unit.  This  memory  may  be  a  sequence  of  one-phoneme 
latches,  each  of  which  is  set  upon  the  appearance  of  a 
phoneme.  (See  Figure  2.) 

One  output  of  the  phoneme  unit  is  a  "phoneme  recog¬ 
nized"  signal,  activated  whenever  there  is  a  sufficiently 
strong  similarity  between  the  input  signal  and  the  sig¬ 
nals  of  the  training  set,  encoded  into  the  module  link 
weights.  Then  successive  activations  of  the  "phoneme 
recognized"  signal  cause  successive  phonemes  to  be 
latched.  And,  upon  the  recognition  of  a  syitype,  a  "syi¬ 
type  recognized"  signal  will  reset  the  phoneme  counter. 

The  form  of  synchronization  implied  in  the  structure 
of  Figure  2  is  an  attempt  to  overcome  one  of  the  greatest 
difficulties  of  the  neural  network  approach  to  speech 
recognition:  the  variation  in  the  speed  of  speech.  The 
solution  shown  in  the  figure,  which  reverts  to  sequential 
digital  logic,  is  certain  to  be  one  of  the  weakest  links 
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of  the  system.  A  single-bit  error,  such  as  the  failure 
or  inappropriate  activation  of  the  "phoneme  recognized" 
or  "syltype  recognized"  signals  would  terminate  success¬ 
ful  recognition.  However,  there  are  modifications  of  the 
proposed  outline  that  will  ameliorate  this  weakness.  One 
possibility  is  the  operation  of  several  recognition  banks 
in  parallel,  activated  by  recognized  signals  of  various 
degrees  of  sensitivity.  Another  is  the  design  of 
special-purpose  neural  networks  that  will  be  trained  to 
provide  the  latching  internal  to  the  network. 

The  third  neural  module  in  Figure  1  is  the  isolated 
word  recognition  module.  The  input  is  a  sequence  of  syl- 
types,  latched  in  the  same  manner  as  the  phonemes  are  at 
the  input  to  the  syltype  recognition  unit.  The  word 
recognition  unit  requires  only  a  ten-bit  output  for  a 
1000-word  vocabulary.  A  27-bit  output  will  provide 
three-bit  error  correction.  To  enable  word  recognition 
at  the  third  neural  module,  the  particular  phonemes  used 
in  the  syltype,  that  is,  the  exact  syllable,  should  be 
employed.  Therefore,  the  path  from  the  phoneme  unit  to 
the  word  recognition  unit  is  provided  as  well. 

As  it  is  shown  in  Figure  1,  the  continuous  speech 
recognition  module  takes  its  input  from  the  isolated  word 
recognition  module.  This  is  an  oversimplification.  The 
greatest  difficulty  of  continuous  speech  recognition  is 
precisely  that  words  cannot  be  isolated.  Even  if  the  word 
boundaries  were  marked,  a  unit  that  recognizes  words  in 
isolation  would  not  necessarily  recognize  the  word  as  it 
appears  in  continuous  speech,  because  of  the  modification 
of  the  beginning  and  ending  phonemes  to  merge  with  those 
of  the  adjacent  words,  and  the  variations  in  accent  and 
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intonation  induced  by  the  context.  However,  a  neural 
network  that  has  been  conditioned  to  recognize  words  in 
isolation  would  be  expected  to  provide  some  response  on 
the  "word  recognized"  output  line,  as  syllables  and 
phonemes  are  presented  to  it,  even  if  it  does  not  exceed 
the  threshold  required  for  recognition. 

Thus,  the  scheme  shown  in  Figure  3  is  proposed.  It 
uses  the  partial  recognition  capability  of  isolated  word 
recognition  modules  in  conjunction  with  a  network  that  is 
conditioned  only  in  the  continuous  speech  recognition 
mode.  Each  of  the  three  isolated  word  recognition 
modules  is  operated  in  parallel,  to  receive  the  identical 
modifications  to  the  link  weights,  when  operated  in  the 
isolated  word  (training)  mode.  In  the  continuous  speech 
recognition  mode,  the  elements  recognized  at  lower  levels 
(syltypes  and  phonemes)  are  latched  at  the  input  to  the 
lower  isolated  word  recognition  module.  Upon  some 
activity  of  the  “isolated  word  recognized"  signal,  the 
entire  group  of  latched  inputs  is  shifted  up  to  the  next 
isolated  word  module.  Then  further  syltypes  and  phonemes 
are  presented  to  the  lower  module,  and  both  inputs  are 
shifted  up  upon  some  combined  activity  of  the  "isolated 
word  recognized"  signals. 

The  continuous  speech  recognition  module  is  the  net¬ 
work  of  cells  that  fills  the  areas  of  Figure  3  that  are 
not  occupied  by  an  isolated  word  recognition  unit.  It 
takes  its  input  from  the  results  of  each  of  the  isolated 
word  recognition  modules,  as  well  as  the  inputs  to  those 
modules.  In  the  training  mode,  continuous  speech  module 
links  are  modified  by  correct  recognition  of  sequences  of 
words.  The  isolated  word  units  receive  a  lesser 
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mollification  during  continuous  speech  training-  The 
put  of  the  continuous  speech  module  is  derived  primarily 
from  the  center  isolated  word  r.-oogn  i  t  i  on  module.  Rut  it 
is  also  influenced.  by  cells  tht'  <•<  >nd  i  t  i  nned  by  t  h** 
precedin']  and  following  word-.,  >  1  t  phonemes  of  the 

transition  between  them .  'is,  *•  !:-•  cont  i  nuous  speech 

recognition  module  is  train--  i  *  r-*es>r..,d  precisely  to  the 

differences  between  words  as  they  appear  in  isolation, 
and  as  they  appear  in  continuous  speech. 

4 .  CONCLUSIONS 

The  research  papers  that  were  reviewed  showed  that 
neural  network  models  can  be  defined  in  a  variety  of 
ways,  but  that  networks  with  lateral  inhibition  seem  to 
offer  greater  discrimination  capability  per  network  cell. 
From  Hopfield's  model  [1],  one  would  expect  tnat  a  module 
that  can  recognize  40  phonemes  at  a  low  error  rate  would 
have  at  least  400  cells.  But  the  experiments  by  Rum- 
melhart  and  Zipser  [7],  using  competitive  learning  (a 
form  of  lateral  inhibition),  indicate  that  40  distinct 
and  fixed  patterns  could  be  recognized  with  just  40 
cells.  Because  of  the  noisy  nature  of  the  data,  the 
number  of  cells  required  to  recognize  the  phonemes  in 
real  speech  will  be  greater  than  that  implied  by  these 
elementary  results.  But  with  the  benefit  of  lateral 
inhibition,  this  number  should  not  be  more  than  a  few 
thousand  at  most. 

The  number  of  cel  Is  required  in  the  remaining  speech 
recognition  modu 1 es  is  compu  rubl e  to  that  of  the  phoneme 
module.  Thus,  the  proposed  mole]  of  riel  ra  1  -  netwo  rk  based 
speech  ‘-ecogni  t  i  on  system  i  s  .nenabl  e  to  development  and 


validation  through  simulation  modelling.  Within  a  few 
years,  chips  will  be  available  for  the  implementation  of 
the  neural  networks.  Systems  based  on  these  hardware  com¬ 
ponents,  possibly  organized  according  to  the  outline 
given  here,  will  be  built.  They  will  be  simpler,  and  have 
superior  learning  and  recogintion  capability  than  the 
sequential  systems  in  use  today.  Considerable  work 
remains  to  be  done  to  realize  this  promise. 
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